Goto

Collaborating Authors

 hierarchical feature learning


TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

Neural Information Processing Systems

Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences. Sign videos consist of continuous sequences of sign gestures with no clear boundaries in between. Existing SLT models usually represent sign visual features in a frame-wise manner so as to avoid needing to explicitly segmenting the videos into isolated signs. However, these methods neglect the temporal information of signs and lead to substantial ambiguity in translation. In this paper, we explore the temporal semantic structures of sign videos to learn more discriminative features. To this end, we first present a novel sign video segment representation which takes into account multiple temporal granularities, thus alleviating the need for accurate video segmentation. Taking advantage of the proposed segment representation, we develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet. Specifically, TSPNet introduces an inter-scale attention to evaluate and enhance local semantic consistency of sign segments and an intra-scale attention to resolve semantic ambiguity by using non-local video context. Experiments show that our TSPNet outperforms the state-of-the-art with significant improvements on the BLEU score (from 9.58 to 13.41) and ROUGE score (from 31.80 to 34.96) on the largest commonly used SLT dataset.


Review for NeurIPS paper: TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

Neural Information Processing Systems

Weaknesses: W1 The submission claims that existing approaches only capture spatial appearance (line 42), but the one that is compared with [2] is actually based on RNNs, that have the potential to capture motion information across a sequence of frames. W2 While the work acknowledges the challenges of of motion blurs and fine-grained gesture details (line 40), it does not address them in the proposed approach. W3 The quantitative gains in terms of BLEU (9.58 to 13.41) and ROUGE (31.80 to 34.96) scores are not outstanding. W4 The results of [2] by exploiting the glosses available in the dataset are better than the ones in this submission. Given that the contributions of the work address the visual representation, it is not argues why the proposed techniques are also assess with the Sign-to-Gloss-to-Text set up considered in [2].


Review for NeurIPS paper: TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

Neural Information Processing Systems

The reviewers were positive about the ideas in the paper and mostly debated the merits of the evaluation. For one they were not fully convinced about the arguments in the rebuttal about the differences between the sharpness of boundaries for action localization and sign language translation. For camera ready I would suggest better addressing this point, as well as comparing or justifying differences to "Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation", Camgoz et al, CVPR 2020. One final suggestion is to add results with one more video encoder in addition to I3D.


TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

Neural Information Processing Systems

Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences. Sign videos consist of continuous sequences of sign gestures with no clear boundaries in between. Existing SLT models usually represent sign visual features in a frame-wise manner so as to avoid needing to explicitly segmenting the videos into isolated signs. However, these methods neglect the temporal information of signs and lead to substantial ambiguity in translation. In this paper, we explore the temporal semantic structures of sign videos to learn more discriminative features.